Exploring meta-data of human vaginal microbiome

Group 6

Alberte Englund
Mathilde Due
Line Winther Gormsen
Sigrid Frandsen
Kristine Johansen

STUDY DESCRIPTION

Meta-data from MGnify’s vaginal microbiome genome catalogue

  • Uncover patterns in genome quality, taxonomic composition, and ecological characteristics.

  • Identify potential patterns for diagnosis of endometriosis via associated pathogens of the vaginal microbiota based on genus:

    • Anaerococcus, Ureaplasma, Gardnerella, Veillonella, Corynebacterium, Peptoniphilus, Candida, Alloscardovia 1

DATA CLEANING AND WRANGLING

Untidy –> tidy data

  1. Split the data in the “lineage” variable into multiple variables of seven taxonomic ranks.
  2. Extract and mutate prefixes and GTDB suffixes (e.g. “_A”) to streamline taxonomies
  3. Each variable occupies a column, and each observation occupies a row.
  4. Mutate all “not provided” and empty strings to NA.
  5. Remove columns that will not be used in our analysis.
  6. Quality of the MAGs - completeness, contamination, and genome quality (high, medium, low) - joined to the left side of tidy data.
  7. Add column to MAGs dataset that flags each endometriosis-associated genus (True/False)
print(readr::read_tsv(here("data/_raw/genomes-all_metadata.tsv")))
# A tibble: 618 × 20
   Genome        Genome_type  Length N_contigs    N50 GC_content Completeness
   <chr>         <chr>         <dbl>     <dbl>  <dbl>      <dbl>        <dbl>
 1 MGYG000303700 MAG          678213         2 466332       47.8         63.7
 2 MGYG000303701 MAG         1500176        18 112881       42.4         87.8
 3 MGYG000303702 MAG         1210062        44  48790       26.4         94.8
 4 MGYG000303703 MAG         1706016        27  89653       44.6         93.7
 5 MGYG000303704 MAG          703182         7 111709       47.8         63.7
 6 MGYG000303705 MAG         2542045       112  34925       48           97.9
 7 MGYG000303706 MAG         1449687       185  10153       34.8         85.2
 8 MGYG000303707 MAG         1874692        90  28768       37.1         99.0
 9 MGYG000303708 MAG         1480380        12 169949       42.2         87.6
10 MGYG000303709 MAG          694644        57  15063       47.9         62.0
# ℹ 608 more rows
# ℹ 13 more variables: Contamination <dbl>, rRNA_5S <dbl>, rRNA_16S <dbl>,
#   rRNA_23S <dbl>, tRNAs <dbl>, Genome_accession <chr>, Species_rep <chr>,
#   Lineage <chr>, Sample_accession <chr>, Study_accession <chr>,
#   Country <chr>, Continent <chr>, FTP_download <chr>
untidy_data <- readr::read_tsv(
  here::here("data/_raw/genomes-all_metadata.tsv"))
  print(
    untidy_data  |>
    dplyr::select(Lineage))
# A tibble: 618 × 1
   Lineage                                                                      
   <chr>                                                                        
 1 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
 2 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
 3 d__Bacteria;p__Bacillota;c__Bacilli;o__Staphylococcales;f__Gemellaceae;g__Ge…
 4 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
 5 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
 6 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidacea…
 7 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Tissierellales;f__Peptoniphilace…
 8 d__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g…
 9 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
10 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
# ℹ 608 more rows
print(readr::read_tsv(here("data/02_dat_clean.tsv")))
# A tibble: 618 × 21
   Genome        Genome_type  Length N_contigs    N50 GC_content Completeness
   <chr>         <chr>         <dbl>     <dbl>  <dbl>      <dbl>        <dbl>
 1 MGYG000303700 MAG          678213         2 466332       47.8         63.7
 2 MGYG000303701 MAG         1500176        18 112881       42.4         87.8
 3 MGYG000303702 MAG         1210062        44  48790       26.4         94.8
 4 MGYG000303703 MAG         1706016        27  89653       44.6         93.7
 5 MGYG000303704 MAG          703182         7 111709       47.8         63.7
 6 MGYG000303705 MAG         2542045       112  34925       48           97.9
 7 MGYG000303706 MAG         1449687       185  10153       34.8         85.2
 8 MGYG000303707 MAG         1874692        90  28768       37.1         99.0
 9 MGYG000303708 MAG         1480380        12 169949       42.2         87.6
10 MGYG000303709 MAG          694644        57  15063       47.9         62.0
# ℹ 608 more rows
# ℹ 14 more variables: Contamination <dbl>, rRNA_5S <dbl>, rRNA_16S <dbl>,
#   rRNA_23S <dbl>, tRNAs <dbl>, Country <chr>, Continent <chr>, Domain <chr>,
#   Phylum <chr>, Class <chr>, Order <chr>, Family <chr>, Genus <chr>,
#   Species <chr>
tidy_data <- readr::read_tsv(
  here::here("data/02_dat_clean.tsv"))
  print(
    tidy_data  |>
    dplyr::select(Domain, Phylum, Class, Order, Family, Genus, Species))
# A tibble: 618 × 7
   Domain   Phylum          Class           Order           Family Genus Species
   <chr>    <chr>           <chr>           <chr>           <chr>  <chr> <chr>  
 1 Bacteria Patescibacteria Saccharimonadia Saccharimonada… Nanop… Nano… Nanope…
 2 Bacteria Bacillota       Clostridia      Saccharofermen… Fasti… KA00… KA0027…
 3 Bacteria Bacillota       Bacilli         Staphylococcal… Gemel… Geme… Gemell…
 4 Bacteria Bacillota       Clostridia      Saccharofermen… Fasti… Mage… Mageei…
 5 Bacteria Patescibacteria Saccharimonadia Saccharimonada… Nanop… Nano… Nanope…
 6 Bacteria Bacteroidota    Bacteroidia     Bacteroidales   Bacte… Prev… <NA>   
 7 Bacteria Bacillota       Clostridia      Tissierellales  Pepto… Pept… Pepton…
 8 Bacteria Bacillota       Bacilli         Lactobacillales Lacto… Lact… Lactob…
 9 Bacteria Bacillota       Clostridia      Saccharofermen… Fasti… KA00… KA0027…
10 Bacteria Patescibacteria Saccharimonadia Saccharimonada… Nanop… Nano… Nanope…
# ℹ 608 more rows
options(width = 200)
Aug_data <- readr::read_tsv(
  here::here("data/03_dat_aug.tsv"))
  print(
    Aug_data  |>
    dplyr::select(Completeness_quality, Contamination_quality, Overall_quality, endometriosis_associated),
    width = Inf)
# A tibble: 618 × 4
   Completeness_quality Contamination_quality Overall_quality endometriosis_associated
   <chr>                <chr>                 <chr>           <chr>                   
 1 Medium               High                  Medium          No                      
 2 Medium               High                  Medium          No                      
 3 High                 High                  High            No                      
 4 High                 High                  High            No                      
 5 Medium               High                  Medium          No                      
 6 High                 High                  High            No                      
 7 Medium               High                  Medium          Yes                     
 8 High                 High                  High            No                      
 9 Medium               High                  Medium          No                      
10 Medium               High                  Medium          No                      
# ℹ 608 more rows

DATA DESCRIPTION

  • 618 vaginal metagenome-assembled genomes (MAGs)
  • 25 variables covering taxonomy, assembly quality, and geography
  • High completeness and low contamination for most genomes
  • Dataset dominated by a few major bacterial phyla
  • Genome lengths fall within biologically expected ranges

Most MAGs belong to only a few dominant phyla.
This indicates strong taxonomic skew in the dataset.


Most genomes have high completeness (>90%),
indicating generally strong assembly quality.


Genome lengths fall within the expected biological range
for vaginal bacterial taxa (typically 1.5–3 Mb).

ANALYSIS 1 - Phylogenetic tree

  • Constructed a phylogenetic tree from the taxonomic ranks (Domain → Species) in the augmented dataset.
  • NA values in taxonomy were removed before tree construction.
  • The tree is colored by phylum to show taxonomic clustering.

ANALYSIS 2 - Data quality

Investigating the quality of samples.

  • Scatterplot of completeness (%) vs. contamination (%) for all MAGs
  • The dashed lines mark the high-quality genomes (above 90% completeness and below 5% contamination)
  • Points are colored by phylum
  • Most MAGs cluster in the high-quality area indicating good assemble quality of the MAGs.

ANALYSIS 3 - Associated and non-associated-endometriosis MAGs

  • Compared endometriosis-associated vs non-associated MAGs
  • Focused on GC content, genome length, completeness & contamination
  • Investigated whether associated MAGs cluster taxonomically
  • Goal: determine if associated MAGs form a genomically distinct group

Endometriosis-associated MAGs occur in only a few phyla.
Most phyla contain no associated MAGs, suggesting limited taxonomic clustering.


GC content ranges overlap almost completely.
No evidence that GC% distinguishes associated vs non-associated MAGs.

ANALYSIS 4 - Species Distribution between countries

  • Investigating the distribution of lineage groups in Countries
  • Counted group instances for each country and wide format
  • Filtered for NA in Countries
  • Big difference in sample size –> normalize

Some variation between countries. I.e. Order Bacteroidales, but not much.

Could’ve tested for significance.

Only two principal components –> 100% variance.

Clear division of countries.

Along PC1, Fusobacteria and Bacteroidota (order = Bacteroidales). Correlation with heatmap.

Test to see if there is a significant difference in the proportion of endometriosis-associated genomes between countries

Fisher Exact Test Results
country1 country2 p_value odds_ratio CI_low CI_high
USA China 0.0205 2.670 1.113 7.779
USA Denmark 0.1554 4.648 0.738 193.498
Denmark China 1.0000 0.576 0.012 5.077
  • Significant difference between USA and China
  • USA displayed 2.7 times higher odds of endometriosis-associated genomes
  • No significant difference between USA vs. Denmark or Denmark vs. China
  • Likely due to the limited amount of danish and chinese genomes represented

DISCUSSION & FUTURE PERSPECTIVES

  • High-quality MAGs with good completeness

  • No strong genomic differences between groups

  • Limited metadata and uneven sampling

  • Improve clinical + geographic metadata

  • More focus on theoretically correct analysis (compositional data analysis)

CONCLUSION